For this assignment you should create your own .Rmd file including the questions and code provided on this page.
You should turn in a compiled HTML document generated by your .Rmd file, which should contain your all of your code and the requested output for each question.
Please do not include superfluous output, such as printing large data.frames or vectors, in the file you submit. You should comment-out any code that prints output that isn’t asked for in the question you’re answering. You may lose points if your assignment is overly cluttered with unnecessary output.
Unsupervised learning approaches are frequently used in the analysis of genomic data, which often contain thousands of genetic variables. The dataset NCI60 in the ISLR package contains 6830 gene expression measurements (variables) for 64 cancer cell lines (observations). Each cell line has known cancer type; however, the goal of this analysis to explore the extent to which gene expression data can be used to characterize and identify different types of cancer.
library(ISLR)
data("NCI60")
nciData <- NCI60$data
labels <- NCI60$labs
Part A:
Perform PCA on the gene expression data (be sure to standardize) storing your results in an object named nci_pca. Then use plot_ly to visualize the scores of each cell line in the first three principal components using a 3-D scatterplot. These scores are part of the default PCA output, and are contained in the columns nci_pca$x[,1] (scores in PC1), nci_pca$x[,2] (scores in PC2), and nci_pca$x[,3] (scores in PC3). Finally, color the cell lines in your plot using the cancer types contained in the labels vector.
Part B:
Construct a scree plot and use it to determine the proportion of variance that is explained by the first 3 principal components. Comparing this proportion with the number of variables in this dataset and looking at the 3-D scatterplot you constructed in Part A, make an argument for the usefulness of applying PCA to these data.
Part C:
Construct a distance matrix containing the Euclidean distance between each of the 64 cell lines, then plot your distance matrix using the fviz_dist function. Based upon this plot, do you believe these data can be effectively clustered? briefly explain.
Part C:
Apply agglomerative clustering to these data and plot the resulting dendrogram. Your dendrogram should label the end nodes using the type of cancer that cell line represents. For example if you store AGNES clustering results as ag, you use the code ag$order.lab <- labels[ag$order] to assign names. You should also use the cex argument in your visualization to reduce the size of the text so that the entire label can be seen. In total, your dendrogram should resemble the one shown below:
Part D:
Write a short paragraph interpreting the dendrogram you created in the Part C. Your paragraph should address:
Part E:
Create a dendrogram like the one in Part C, but this time using divisive clustering. Then, briefly comment on the similarities/differences between this dendrogram and the one in Part C. Specifically address colon cancer in your answer.
The Congressional Quarterly Almanac documents the actions of the US congress, including votes on particular bills. The dataset read in the code chunk below contains the votes of each congressmen in the house of representatives (the 98th congress, 1984). The dataset is stored at the UC-Irvine machine learning repository, and was originally assembled as part of a UCI student’s doctoral dissertation.
pol <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data", header = FALSE)
head(pol)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
## 1 republican n y n y y y n n n y ? y y y n y
## 2 republican n y n y y y n n n n n y y y n ?
## 3 democrat ? y y ? y y n n n n y n y y n n
## 4 democrat n y y n ? y n n n n y n y n n y
## 5 democrat y y y n y y n n n n y ? y y y y
## 6 democrat n y y n y y n n n n n n y y y y
The variables in this dataset are defined as follows:
Part A:
Use the daisy function to construct the Gower distance matrix of different congressmen. Then, use the fviz_dist function to graph this distance matrix. Based upon this graph, do you think these data can be clustered? Briefly explain.
Part B:
Apply PAM clustering to these data using the elbow method to select \(k\) (see the clustering lab for code on how to do this using a for-loop). Include a plot of the PAM objective function (vs. \(k\)), as well as your choice of \(k\) and a brief justification.
Part C:
Apply PAM clustering to these data using \(k = 3\), storing your results in an object pam_pol. Then, use the vector pam_pol$clustering (which contains the cluster assignments) to filter the original data into three subsets (name these c1, c2, and c3). Within each of these subsets, use the table function to tally the number of republicans and democrats in that cluster. You may leave your table output “as is”, or you may combine it using either the rbind or cbind functions.
Part D:
The code provided below calculates the proportion of “y” votes for each of the 16 variables within in each cluster (ignore the V1 column, which is part identification). Comparing these proportions across clusters, come up with an explanation of the primary differences between each cluster. You can add your explanation to the bulleted list below the code chunk.
count_yes <-function(x){sum(x == "y")/length(x)} ## Function to calculate proportion of a column
## c1, c2, c3 are the filtered datasets corresponding to each cluster
yvotes <- rbind(apply(c1, 2, count_yes),
apply(c2, 2, count_yes),
apply(c3, 2, count_yes))
rownames(yvotes) <- c("cluster1", "cluster2", "cluster3")
yvotes
## V1 V2 V3 V4 V5 V6 V7
## cluster1 0 0.1917098 0.5336788 0.1761658 0.84455959 0.95854922 0.93782383
## cluster2 0 0.8879310 0.3534483 0.9310345 0.04310345 0.05172414 0.06896552
## cluster3 0 0.3730159 0.4047619 0.8809524 0.07142857 0.16666667 0.65873016
## V8 V9 V10 V11 V12 V13
## cluster1 0.1502591 0.09326425 0.07772021 0.4818653 0.2383420 0.77720207
## cluster2 0.9310345 0.96551724 0.87068966 0.2500000 0.2068966 0.03448276
## cluster3 0.8095238 0.88888889 0.72222222 0.7460317 0.6349206 0.13492063
## V14 V15 V16 V17
## cluster1 0.86010363 0.9378238 0.07772021 0.5388601
## cluster2 0.07758621 0.1206897 0.69827586 0.6206897
## cluster3 0.26984127 0.4206349 0.61904762 0.7380952
In your analysis please use the variable descriptions provided at the beginning of this application (rather than simply discussing “V1”, etc.)
Part E:
Now apply PAM clustering to these data using the silhouette method to select \(k\) (see the clustering lab for code on how to do this using a for-loop). Include a plot of the average silhouettes (vs. \(k\)), as well as your choice of \(k\) and a brief justification.
Part F:
Use the silhouette function to calculate silhouettes for the 435 congressmen in this dataset. Then, create a new data.frame containing cluster and sil_width from the output of the silhouette function, along with party identifiers from the original dataset. Then use the table function to create a two-way frequency table displaying the frequencies of each party in each cluster.
Part G:
Finally, filter the new data.frame you created in Part F to only include congressmen with negative silhouettes and then use the table function to create a two-way frequency table displaying the frequencies of each party in each cluster (having negative silhouettes).
In homework #3 you used PCA to analyze a subset of variables from the Ames Housing dataset. The code used to create that subset is provided below:
AmesHousing <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/data-viz/data/AmesHousing.csv")
## Select numeric variables of interest
Ames_Subset <- select(AmesHousing, LotFrontage, LotArea, OverallQual, OverallCond, BsmtFinSF1, BsmtUnfSF, GrLivArea, TotRmsAbvGrd, FullBath, GarageArea, WoodDeckSF, ScreenPorch, SalePrice)
## Remove homes w/ missing data
Ames_Subset <- na.omit(Ames_Subset)
Part A:
Use the pairs function to perform exploratory data analysis on the Ames_Subset dataset. In your answer to this question include your scatterplot matrix and three trends you notice from the plot (this could be variable associations, variables with unusual or skewed distributions, outliers, etc.)
Part B:
Fit a multiple linear regression model that predicts SalePrice using all of the other variables in Ames_Subset. Then, provide an interpretation for the estimated effect of TotRmsAbvGrd (total rooms above ground) on sale price.
Part C:
Calculate the in-sample RMSE of the model you fit in Part B. (Hint: See section 3 of the linear regression lab for an example of this).
Part D:
Use the caret package to estimate the out-of-sample RMSE using 10 repeats of 5-fold cross validation. How does the out-of-sample performance of this model compare to its in-sample performance.
Part E:
The code below fits a LASSO regression model to these data within the caret framework. Compare the out-of-sample RMSE of this model with the out-of-sample RMSE of the multiple regression model you trained in Part D using the resamples function.
library(glmnet)
## Setup predictor and outcome objects
X = as.matrix(select(Ames_Subset, -SalePrice))
y = Ames_Subset$SalePrice
## Fit lasso model
lasso <- glmnet(x = X, y = y)
## Extract lambda sequence
lams <- expand.grid(alpha = 1, lambda = lasso$lambda)
## Cross-validate model using caret framework
set.seed(123)
fit.lasso <- train(SalePrice ~ ., data = Ames_Subset, method = "glmnet", trControl = fit.control, tuneGrid = lams)
Part F:
Use the output of the coef function provided below to inspect the final LASSO model, how do the estimated effects in this model compare with those in the multiple regression model? What impact do you think these differences have on the out-of-sample performance of each model?
coef(fit.lasso$finalModel, s = fit.lasso$bestTune$lambda)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.139156e+05
## LotFrontage 7.642727e+01
## LotArea 7.249791e-01
## OverallQual 2.648822e+04
## OverallCond 5.057498e+02
## BsmtFinSF1 3.315787e+01
## BsmtUnfSF 8.599130e+00
## GrLivArea 3.978713e+01
## TotRmsAbvGrd .
## FullBath 6.540528e+03
## GarageArea 5.302067e+01
## WoodDeckSF 3.376791e+01
## ScreenPorch 5.889131e+01
Read the article: Science Isn’t Broken, paying special attention to the “Hack your way to scientific glory” app. This article involves hypothesis testing, a topic you may also want to review if you don’t remember it from your intro stats class. Then, respond to the following questions:
Part A: How many different economic models could be displayed using the “Hack your way to scientific glory” app? How do you feel about using \(p\)-values to select a model? How do you feel about using cross-validation to select a model? Briefly explain.
Part B: In the “Same Data, Different Conclusions” panel, do you think each it is possible is that all of the different models were chosen by different research teams were selected using cross-validation? Briefly explain.
Part C: Sometimes even other scientists are slow to let go of their beliefs despite the existence of substantial contradictory evidence, what are two examples the article provides of this? How can you avoid falling into this behavior in your own future career? Describe these two examples and write a 2-4 sentence reflection.